class: center, middle, inverse, title-slide .title[ # FIN7028: Times Series Financial Econometrics ] .subtitle[ ## Exploring data ] .author[ ### Barry Quinn ] .date[ ### 2023-02-16 ] --- layout: true <div class="my-footer"> <span> Barry Quinn CStat </span> </div> --- .acid[ .hand[Learning Outcomes] - Visualise arguments with data - Access data programmatically - Fake data, simulation and uncertainty - Statistically engineered robots in finance - Visual trends and patterns - Criticise poor visualisations ] --- class: middle # Rethinking visualisation .fat[ - Charts are not meant to be *seen*, they are intended to be *read* - They are not just images but **visual arguments**. - Data doesn't *speak for itself*. - Data visualisations need to be **shown** and **explained** ] --- class: middle ## `ts` objects and `ts` function A time series is stored in a `ts` object in R: - a list of numbers - information about times those numbers were recorded. ### Example | Year| Observation| |----:|-----------:| | 2012| 123| | 2013| 39| | 2014| 78| | 2015| 52| | 2016| 110| ```r y <- ts(c(123,39,78,52,110), start=2012) ``` --- class: middle ## `ts` objects and `ts` function .pull-left[ For observations that are more frequent than once per year, add a `frequency` argument. E.g., monthly data stored as a numerical vector `z`: ```r y <- ts(z, frequency=12, start=c(2003, 1)) ``` ] .pull-right[ `ts(data, frequency, start)` ####Type of data |frequency|start|example| |:--:|:--:|:--:| |Annual|1 |1995| |Quarterly|4|`c(1995,2)`| |Monthly|12|`c(1995,9)`| |Daily|7 or 365.25| 1 or c(1995,234)| |Weekly| 52.18| c(1995,23)| |Hourly|24 or 168 or 8,766 | 1| |Half-hourly|48 or 336 or 17,532 | 1| ] --- class: middle ## Class package (pre-loaded in Q-RaP) .pull-left[ `library(tidyverse); library(tidyquant); library(DT)` `DT` loads **datatable** interactive table visualisation `tidyquant` loads: * **tq_get** function (for getting data from online sources) * **tq_transmute** function (for transforming between time frequencies) * **tq_mutate** function (create return series) `tidyverse` loads many packages include: * **ggplot** plotting package * **dplyr** data wrangling package ] .pull-right[ ``` library(fpp2) ``` This loads: * **forecast** package (for forecasting functions) * **fma** package (for lots of time series data) * **expsmooth** package (for more time series data) ] --- class: middle ## Some class data ```r MyDeleteItems<-ls() rm(list=MyDeleteItems) # Good practice to clear all objects before loading data library(tsfe) # includes class datasets data(package='tsfe') ``` --- class: middle ## Programmatically accessing data from the
.pull-left[ * Using WRDS (more later) * Download data using `tg_get` * Create a monthly series using `tq_transmute` ```r ftse<-tq_get("^FTSE",from="2016-01-01") ftse_m<-tq_transmute(ftse, select = adjusted, mutate_fun = to.monthly) ftse_m_ts<-ts(ftse_m$adjusted, start=c(2016,1), freq=12) ``` ] .pull-right[ - `ts` object has plotting functionality. - Quick visualisation of financial time trends in monthly ftse 100 index price ```r autoplot(tsfe::ftse_m_ts) + theme_tq() ``` <!-- --> ] --- class: middle ## Is a table a good visualisation? - Quarterly earnings per share data for carnival PLC. - Using `DT` to create an interactive table ```r tsfe::ni_hsales %>% datatable() ```
--- class: middle ## Rethinking visualisation using `ggplot2` .pull-left[ * `ggplot2` is based on [The Grammar of Graphics](http://amzn.to/2ef1eWp), the idea that you can build every graph from the same components: a data set, a coordinate system, and geoms—visual marks that represent data points. ] .pull-right[ #### Example: Exchange rate time series ```r tsfe::usuk_rate %>% # Data ggplot(aes(x=date, y=price )) + # Coordinate system geom_line(colour="pink") # geom ``` <!-- --> ] --- class: middle .your-turn[ * Use `tq_get` to download the CBOE VIX Index from `2016-01-01` using the symbol `^VIX` * create a time series object using the VIX index price. * Plot this daily VIX Index Price using `autoplot`. **Hint:** the ts object of daily financial time series does not have a regular frequency to input into `ts()` function, so leave this argument blank. ] --- class: middle # Distributions properties of asset returns #### Why normal? * This is a plot of a normal distribution with mean equal to zero and variance equal to one. <img src="data:image/png;base64,#exploring-data_files/figure-html/normalplot-1.png" style="display: block; margin: auto;" /> --- class: middle ## Why normal? * Named after Carolo Friderico Gavss, the normal (or Gaussian) distribution is the most common used distributional assumption in statistical analysis. .large[This is because:] * Easy to calculate with * Common in nature * Very conservative assumptions --- class: middle ## Rethinking econometrics: ### `the importance of simulation` - Simulation of random variables is important in applied statistics for several reasons. - .larger[First], we use probability models to mimic variation in the world, and the tools of simulation can help us better understand how this variation plays out. .blockquote[ Patterns of randomness are notoriously contrary to normal human thinking—our brains don’t seem to be able to do a good job understanding that random swings will be present in the short term but average out in the long run—and in many cases simulation is a big help in training our intuitions about averages and variation. <footer>Gelman et al. 2021</footer> ] --- class: middle ## Rethinking econometrics: ### `the importance of simulation` .large[ - .Second, we can use simulation to approximate the sampling distribution of data and propagate this to the sampling distribution of statistical estimates and procedures. - Third, regression models are not deterministic; they produce probabilistic predictions. ] --- class: middle ## Rethinking econometrics: `the importance of simulation` .large[ - Simulation is the most convenient and general way to represent uncertainties in forecasts. Throughout this course and in our practice, we use simulation for all these reasons; - In this lecture we introduce the basic ideas and the tools required to perform simulations in R. ] --- class: middle .discussion[ * Suppose your company is being IPO'd at a starting price of £40. * You want to know the future price of the stock in 200 days. * You have been told Monte Carlo simulation can help predict stock market futures * Can you create a number of possible future price paths and find average price in 200 days? * How many futures should you create? * What assumptions should we make about these futures? ] --- class: middle .salt[Simulation in
] .panelset[ .panel[ .panel-name[custom-made R function] - The coding for simulation becomes cleaner if we express the steps for a single simulation in an
function. ] .panel[ .panel-name[`replicate` function] - For simplicity I have *hard-coded* one sample path of our IPO closing price in the previous function. - We can then use replicate function to call `path_sim` 10000 times ] .panel[ .panel-name[Summarising simulations] - Simulations are a versatile way to summarise a probability model, predictions from a fitted regression or uncertainty about parameters of a fitted model (*the probabilistic equivalent of estimates and standard errors*) - One useful way to summarise the *location* of the distribution is to use the `median` function in the variation in the distribution is the *median absolute deviation standard deviation (mad sd)*. ] .panel[ .panel-name[Median statistics] .blockquote[ We typically prefer median-based summaries because they are more computationally stable, and we rescale the median-based summary of variation as described above so as to be comparable to the standard deviation, which we already know how to interpret in usual statistical practice. ] ``` ## median = 48.69729 mad sd = 3.427117 mean = 48.8466 sd = 3.470068 ``` ] ] --- class: middle ## Why normal? .fat[ Practical reasons * Processes that produce normal distributions include: * Addition * Lots of process are approximately normal * Product of small deviations * logarithms of products * Logarithms are just magnitudes ] --- class: middle .fat[Ontological perspective of why normal?] .pull-left[ <iframe width="560" height="315" src="https://www.youtube.com/embed/XTsaZWzVJ4c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ] .pull-right[ * Processes which add fluctuations results in dampening * Damped fluctuations end up Gaussian * No information left, except mean and variance. * Can't infer process from distribution! Most do some more science. ] --- class: middle .acid[Epistemological perspective of why normal?] .pull-left[ <iframe width="560" height="315" src="https://www.youtube.com/embed/lI9-YgSzsEQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ] .pull-right[ * Know only *mean* and *variance* * Then least suprising and most conservative (*maximum entropy*) distribution is Gaussian. * In terms of being conservative, it assumes the least; only the mean and variance. * Nature likes maximum entropy distributions. ] --- class: middle .pull-left[ ## Why normal asset returns? * Financial time series processes considered in this course: * Return series * Financial statement information * Volatility processes * Extreme events * Multivariate series ] .pull-right[ ## Why assume normal asset returns? * Normality assumption allows for asset returns properties to be tractable. * Tractable mean and variance * They provide information about the long-term return and risk, respectively. * Tractable symmetrical properties * Symmetry has important implications in holding short and long financial positions in risk management. * Tractable kurtosis properties * Kurtosis is related to volatility forecasting, efficiency in estimation and tests,etc. ] --- class: middle ## Our first engineered robot .pull-left[ * To engineer a statistical model of asset returns we need to make some assumptions about the data story or *data generating process*. $$ \{R_{it}|i-1,\dots N;t=1,\dots,T\} \stackrel{i.i.d}\sim N (m_1,m_2) $$ * A traditional assumption in financial analysis is the simple returns are independently and identically distributed (iid) as normal with a fixed mean `\(m_1\)` and variance `\(m_2\)`. ] .pull-right.discussion[ * The previous assumption is unrealistic in a number of ways: 1. The lower bound of a simple return is -1, but a normal distribution has no lower bound. 2. Multiperiod simple returns `\(R_it[k]\)` is not normally distributed as it is the *product* of one-period returns. 3. Empirically, asset returns tend to have *heavy tails* or **positive excess kurtosis** ] --- class: middle ## Our second statistical robot: .small[lognormal distribution] $$ \{r_{it}|i=1,\dots N ;t=1,\dots,T\} \stackrel{i.i.d}\sim N (\mu,\sigma^2) $$ * Another common assumption is that log returns are iid as normal with mean `\(\mu\)` and variance `\(\sigma^2\)`. * As the sum of a finite number of iid random variables is normal, `\(r_t[k]\)` is also normally distributed. * There is also no lower bound for `\(r_t\)` * However, lognormal assumption is not consistent with **positive excess kurtosis**. --- class: middle # Empirical properties of log returns ## Are stock returns distributed normally? * The following code loads Glencore Plc asset prices from yahoo finance, converts daily adjusted prices to monthly log returns, and creates a monthly time series object. .pull-left[ .heatinline[Ugly code] ```r glen <- tq_get("GLEN.L") glen_m <- tq_transmute(glen, select = adjusted, mutate_fun = monthlyReturn, type="log", col_rename = "log_return") glen_m_ts <- ts(glen_m$log_return, frequency=12,start=c(2011,5)) ``` ] .pull-right[ .fancy[Piping (%>%) code is more readable?] - `literate programming` ```r glen <- tq_get("GLEN.L") glen_m <- glen %>% tq_transmute(select = adjusted, mutate_fun = monthlyReturn, type="log", col_rename = "log_return") glen_m_ts <- glen_m$log_return %>% ts(frequency=12,start=c(2011,5)) ``` ] --- class: middle ## Are stock returns distributed normally? .panelset[ .panel[ .panel-name[**Visual Arguments**] * We can use `ggplot` to visualise the empirical distribution, superimposing what the returns would look like if they were normally distributed. * **Only two parameters (a mean and a variance) are required to create a hypothetical normal distribution of a returns series** ] .panel[ .panel-name[ggplot code] ```r glen_m %>% ggplot(aes(x=log_return)) + geom_density() + stat_function( fun=dnorm, args=list(mean(glen_m$log_return), sd=sd(glen_m$log_return)), col="red") ``` ] .panel[ .panel-name[output] <!-- --> ] .panel[ .panel-name[Inference] * **What patterns are revealed?** * The normal distribution is superimposed over the histogram of the daily equity returns. * Compared to the normal the distribution of the returns has longer tails and a higher central peak. * In statistical terms we say the distribution is leptokurtic, or fat-tailed. ] ] --- class: middle ## Are stock returns distributed normally? .panelset[ .panel[ .panel-name[Quantile-quantile plot] .pull-left[ ```r tsfe::glen_m %>% ggplot(aes(sample=100*log_return)) + stat_qq(distribution = stats::qnorm) + stat_qq_line(distribution = stats::qnorm, colour="red") + labs(title = "Quantile-quantile plot of glencore stock returns") ``` ] .pull-right[ <!-- --> ] ] .panel[ .panel-name[Inference] * This plot compares the quantiles of a normal distribution (thinner straight line) to the quantiles of the data (thicker scatter plot). * If the plots exactly overlap then the data is probably normally distributed. * While the returns and the normal distribution are similar between +12.5% and -12.5%, outside these limits the returns behave non-normally. ] ] --- class: middle ### Should we care if asset returns are normal distributed variables? .large[ - In regressions the assumption of normality of model errors is one of the least important - For the purpose of estimating the regression line it is barely important at all - Diagnostics of normality of errors is not recommended unless the model is being used to predict individual data points. ] --- class: middle ### Should we care if asset returns are normal distributed variables? - If the distribution of errors is of interest, perhaps because of predictive goals, this should be distinguished from the distribution of the data, `\(y\)`. - **A regression model does not assume or require that predictors are normally distributed** - Furthermore, the normal distribution on the outcome refers to the regression errors, not the raw data. - Depending on the structure of the predictors, it is possible for data `\(y\)` to be far from normally distributed even when coming from a linear regression model. - See Gelman et al., (2020) Chapter 11 for more detail --- class: middle ## Null (.acid[ or Nil]) hypothesis significant testing * Null hypothesis significance tests (NHST) are models. * We assume an underlying data story with distributional properties which then allows us to create p-values based on null hypothesis. * In practice they are often misused to create 'bright line' acceptance or rejection decision about underlying theoretical questions. * In applied statistics their misuse is well understood .blockquote[ Read Bailey, D. H. & Prado, M. L. de. Finance is Not Excused: Why Finance Should Not Flout Basic Principles of Statistics. Ssrn Electron J (2021) doi:10.2139/ssrn.gi3895330. ] --- class: middle ## Are stock returns distributed normally? .acid[Statistical inference] * Assuming the models underpinning the previous NHST are valid, we can reject the Null that Glencore monthly log returns are normally distributed at even the 1% significance level. .acid[Practical inference] * Assuming normality to model the *middle* of the data is probably ok * Modelly extreme observations may need more distributional tools. --- class: middle ## Heavy Tail Statistical Distributions Student's t : very similar to the normal but wider and lower. Stable : generalisation of the normal, stable under addition thus can be used with log returns, can capture excess kurtosis well but has infinite variance which conflicts with finance theory. Scale mixture : a combination of a number of normals. --- class: middle ## Rethinking time plots .pull-left[ * One **limitation** with time plots is that the simple passage of time is not a good explanatory variable. * There are occasional exceptions where there is a clear mechanism driving the financial time series. .blockquote[Descriptive chronology is not causal explanation - Edward Tufte ( 2015) *The visual display of quantitative information* P37 ] ] .pull-right[ ```r glen_m %>% ggplot(aes(x=date,y=log_return)) + geom_line() ``` <!-- --> ] --- class: middle ## Seasonal plots .panelset[ .panel[ .panel-name[Code] ```r glen_m_ts %>% ggseasonplot(year.labels=TRUE, year.labels.left=TRUE) + ylab("") + ggtitle("Seasonal plot: Glencore returns") ``` ] .panel[ .panel-name[Output] <!-- --> ] .panel[ .panel-name[Inference] .discussion[ * Data plotted against the individual "seasons" in which the data were observed. (In this case a "season" is a month.) * Something like a time plot except that the data from each season are overlapped. * Enables the underlying seasonal pattern to be seen more clearly, and also allows any substantial departures from the seasonal pattern to be easily identified. * In R: `ggseasonplot()` ] ] .panel[ .panel-name[Polar plot] ```r glen_m_ts %>% ggseasonplot(polar=TRUE) + ylab("") ``` <img src="data:image/png;base64,#exploring-data_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] .panel[ .panel-name[ Subseries plots] ```r ggsubseriesplot(glen_m_ts) + ylab("") + ggtitle("Subseries plot: Glencore returns") ``` <img src="data:image/png;base64,#exploring-data_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> * Data for each season collected together in time plot as separate time series. * Enables the underlying seasonal pattern to be seen clearly, and changes in seasonality over time to be visualized. * In R: `ggsubseriesplot()` ] ] --- class: middle # Seasonal or cyclic? ## Time series patterns Trend : pattern exists when there is a long-term increase or decrease in the data. Seasonal : pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Cyclic : pattern exists when data exhibit rises and falls that are **not of fixed period** (duration usually of at least 2 years). --- class: middle ## Time series components ### Differences between seasonal and cyclic patterns: * seasonal pattern constant length; cyclic pattern variable length * average length of cycle longer than length of seasonal pattern * magnitude of cycle more variable than magnitude of seasonal pattern --- class: middle ## Time series patterns <!-- --> --- class: middle ## Time series patterns <!-- --> --- class: middle ## Time series patterns <!-- --> --- class: middle ## Seasonal or cyclic? .fat[Differences between seasonal and cyclic patterns] .large[ * seasonal pattern constant length; cyclic pattern variable length * average length of cycle longer than length of seasonal pattern * magnitude of cycle more variable than magnitude of seasonal pattern ] .blockquote[ The timing of peaks and troughs is predictable with seasonal data, but unpredictable in the long term with cyclic data. ] --- class: middle # Lag plots and autocorrelation .pull-left[ ### Example: Earnings per share ```r gglagplot(tsfe::carnival_eps_ts) ``` <img src="data:image/png;base64,#exploring-data_files/figure-html/unnamed-chunk-18-1.png" width="10cm" style="display: block; margin: auto;" /> ] .pull-right[ ### Lagged scatterplots * Each plot shows `\(y_t\)` plotted against `\(y_{t-k}\)` for different values of `\(k\)`. * The autocorrelations are the correlations associated with these scatterplots. ] --- class: middle ## Autocorrelation - One of the most important data properties that financial time series models exploits - **Covariance** and **correlation**: measure extent of **linear relationship** between two variables ($y$ and `\(X\)`). -- - **Autocovariance** and **autocorrelation**: measure linear relationship between **lagged values** of a time series `\(y\)`. -- -.acidinline[We measure the relationship between:] * `\(y_{t}\)` and `\(y_{t-1}\)` * `\(y_{t}\)` and `\(y_{t-2}\)` * `\(y_{t}\)` and `\(y_{t-3}\)` * etc. --- class: middle ## Autocorrelation We denote the sample autocovariance at lag `\(k\)` by `\(c_k\)` and the sample autocorrelation at lag `\(k\)` by `\(r_k\)`. Then define `$$c_k = \frac{1}{T}\sum_{t=k+1}^T (y_t-\bar{y})(y_{t-k}-\bar{y})$$` .salt[and] `$$r_{k} = c_k/c_0$$` * `\(r_1\)` indicates how successive values of `\(y\)` relate to each other * `\(r_2\)` indicates how `\(y\)` values two periods apart relate to each other * `\(r_k\)` is *almost* the same as the sample correlation between `\(y_t\)` and `\(y_{t-k}\)`. --- class: middle ## Autocorrelation Results for first 9 lags for Carnival earnings data: | `\(r_1\)` | `\(r_2\)` | `\(r_3\)` | `\(r_4\)` | `\(r_5\)` | `\(r_6\)` | `\(r_7\)` | `\(r_8\)` | `\(r_9\)` | |:-----:|:------:|:-----:|:-----:|:-----:|:------:|:-----:|:-----:|:------:| | 0.103 | -0.109 | 0.069 | 0.839 | 0.051 | -0.143 | 0.016 | 0.707 | -0.021 | <!-- --> --- class: middle ## Autocorrelation .pull-left[ * `\(r_{4}\)` higher than for the other lags. This is due to **the seasonal pattern in the data**: the peaks tend to be **4 quarters** apart and the troughs tend to be **2 quarters** apart. * `\(r_2\)` is more negative than for the other lags because troughs tend to be 2 quarters behind peaks. * Together, the autocorrelations at lags 1, 2, `\(\dots\)`, make up the **autocorrelation** or ACF. * The plot is known as a **correlogram** ] .pull-right[ #### ACF ```r ggAcf(carnival_eps_ts) ``` <!-- --> ] --- class: middle ### ACF plots .panelset[ .panel[ .panel-name[Trend and seasonality] - When data have a trend, the autocorrelations for small lags tend to be large and positive. - When data are seasonal, the autocorrelations will be larger at the seasonal lags (i.e., at multiples of the seasonal frequency) - When data are trended and seasonal, you see a combination of these effects. ] .panel[ .panel-name[Any trends?] <!-- --> ] .panel[ .panel-name[Any autcorrelation?] <!-- --> ] .panel[ .panel-name[] .discussion[ Time plot shows clear trend and seasonality. The same features are reflected in the ACF. * The slowly decaying ACF indicates trend. * The ACF peaks at lags 4, 8, 12, 16, 20, \dots, indicate seasonality of length 4. ] ] ] --- class: middle .your-turn[ * use `tq_get()` to download GLEN.L stock price from yahoo finance * using this series create a time series object hint **use frequency=1** * Explore this time series for trends and autocorrelation ] --- class: middle # Signal and the noise .heat[ - On important stylised fact of financial time series data is a low `\(\frac{Signal}{Noise}\)` ratio, which changes over time (is dynamic). - The signal is the *sense* you want your model to capture and predict, the noise is the *nonsense* you do not want your model to capture as it is unpredictable. ] .blockquote[ Noise makes trading in financial markets possible, and thus allows us to observe prices in financial assets. [But] noise also causes markets to be somewhat inefficient…. . Most generally, noise makes it very difficult to test either practical or academic theories about the way that financial or economic markets work. We are forced to act largely in the dark <footer>Fisher Black, Noise, Journal of Finance ,41 ,3 (1986) (p.529)</footer> ] --- class: middle ## Example: White noise <!-- --> --- class: middle ## Example: White noise \begingroup\small \begin{tabular}{rr} \midrule `\(r_{1}\)` & 0.13 \\ `\(r_{2}\)` & -0.03 \\ `\(r_{3}\)` & 0.07 \\ `\(r_{4}\)` & -0.10 \\ `\(r_{5}\)` & 0.04 \\ `\(r_{6}\)` & 0.34 \\ `\(r_{7}\)` & 0.15 \\ `\(r_{8}\)` & 0.10 \\ `\(r_{9}\)` & -0.10 \\ `\(r_{10}\)` & -0.07 \\ \end{tabular} \endgroup - Sample autocorrelations for white noise series - We expect each autocorrelation to be close to zero. --- class: middle ## Sampling distribution of autocorrelations Sampling distribution of `\(r_k\)` for white noise data is asymptotically N(0,$1/T$). * 95% of all `\(r_k\)` for white noise must lie within `\(\pm 1.96/\sqrt{T}\)`. * If this is not the case, the series is probably not WN. * Common to plot lines at `\(\pm 1.96/\sqrt{T}\)` when plotting ACF. --- class: middle .your-turn[ You can compute the daily changes in the Google stock price using ```r dgoog <- diff(goog) ``` Does `dgoog` look like white noise? ] --- class: ## Financial Data Forecastability - Predictability of an event or quantity depends on several factors 1. How well we understand the factors that contribute to it 2. How much data are available 3. Whether the forecast can affect the thing we are trying to forecast ## Prediction and EMH - Crudely, the efficent market hypothesis (EMH) implies that returns from speculative assets are *unforecastable* - The overpowering logic of the EMH is: - **If returns are forecastable,there should exist a money machine producing unlimited wealth** - Based on the random walk theory change in price are defined as white noise as follows: `\begin{align*} p_t=p_{t-1} + a_t \\ p_t-p_{t-1} = a_t \\ \text{where} \end{align*}` --- class: middle .salt[ - High quality models identify the signal from the noise in financial data. - The **signal** is the regular pattern that is likely to repeat. - The **noise** is the irregular pattern which occurrs by chance and unlikely to repeat. - Overfitting or data snooping can result in your model capturing both **signal** and **noise**. - Overfitted models usually produce poor predictions and inferences. ]